Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Jan 29, 2026

Purpose and background context

This PR updates how we extract full-text for the mitlibwebsite. After some analysis from DiscoEng, some logic was identified for HTML selectors we could use to grab container elements that contained text relevant to the website, excluding content that repeats for all pages like headers and footers.

NOTE: much of the file churn was updated dependencies and updated linting. This is encapsulated in a single commit. The meaningful changes can be found in this commit.

How can a reviewer manually see the effects of these changes?

Please see this USE-365 Jira ticket comment that links to a spreadsheet analyzing the results of the fulltext field after these changes were implemented.

Includes new or updated dependencies?

YES

Changes expectations for external applications?

YES: reduction of repeating and unhelpful text in the mitlibwebsite will improve search relevancy and reduce noise in the USE interface.

What are the relevant tickets?

Code review

  • Code review best practices are documented here and you are encouraged to have a constructive dialogue with your reviewers about their preferences and expectations.

Why these changes are being introduced:

It was decided that the full-text getting extracted from mitlibwebsite full HTML
was too broad.  We were collecting header and footer data that was not unique to
the record/URL at hand.

How this addresses that need:

After some analysis by DiscoEng, some URL + element selector patterns were identified
to target meaningful container elements.  This has dramatically reduced the amount of full-text
while increasing the quality at the same time.

Side effects of this change:
* mitlibwebsite TIMDEX records have higher quality fulltext field values

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/USE-364
Comment on lines +169 to +170
(True, {"class": "content-main"}), # True = wildcard element
(True, {"class": "main-content"}), # True = wildcard element
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was new syntax to me in BeautifulSoup4 (BS4): you can use True to wildcard match any element.

@ghukill ghukill marked this pull request as ready for review January 29, 2026 16:11
@ghukill ghukill requested a review from a team January 29, 2026 16:11
Copy link
Contributor

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good and a sensible change since clearly there is a lot of non-useful content in the unrefined full test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpfully formatted fixture!

Comment on lines +116 to +120
Using the full-text from the entire page will include far too much content that
is not unique or relevant to the page at hand, including repeating header and
footer data. Our approach may evolve over time, but this method aims to extract
only meaningful full-text from each record based on some simple rules and specific
container elements to look for.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great context

@ghukill ghukill merged commit 1addea6 into main Jan 30, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants